The purpose of this analysis is to explore a dataset featuring characteristics about red wine.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
str(RedWine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The red wine data set contains nearly 1600 observations of 13 variables.
We can see the distribution of quality ratings has a minimum of 3 and a maximum of 8, with most ratings at 5 or 6. Surprisingly, there are no ratings of 1, 2, 9, or 10. I would have expected a larger range of quality ratings with such a large data set.
I divided the data by quality level, with 0-3, 4-6, and 7-10 being the three levels. We can see that the vast majority of observations fall in the medium quality level.
We can see that the Density and pH plots are the most normally distributed. Thee majority of pH levels fall between 3.0 - 3.5. Many of the plots are skewed to the right, including Free Sulfur Dioxide, Total Sulfur Dioxide. The majority of wines havie less than 100 in total sulfur dioxide. Several of the plots are long tailed, such as Residual Sugar and Chlorides.
The above plots compare the plots before and after transformation. The data for residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide becomes more normally distributed after applying log10.
There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, and quality). All of the variables are numeric with the exception of quality, which is in integer form.
Most of the wines have a quality of 5 or 6.
The 3rd quartile of residual sugar levels is 2.6, although there are a few major outliers, with the maximum residual sugar level of 15.5. I’m interested to see if higher residual sugar wines tend to have lower or higher quality.
Most wines have an alcohol content of less than 12%. This surprises me, given that the majority of red wines I’m familiar with have alcohol contents above 13.5%.
Many of the wines have 0 citric acid.
The main feature of interest in the dataset are quality, and I’d like to determine which variables impact quality ratings the most. I suspect alcohol, residual sugar, and pH contribute to quality ratings, as they seem to be features you may be able to decipher during wine tastings.
From research into what contributes to the taste of wine, I discovered that sweetness, acidity, tannin, alcohol, and body are the main features. In addition to pH, I think fixed acidity and volatile acidity may contribute to the acidity of wine.
Yes, I created a new variable called quality level, which cuts the quality levels into low (3, 4), medium (5, 6), and high (7,8).
I deleted column X because it was simply a repeat of the index.
I applied log10 to residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide in order to normalize the distributions.
I analyzed the following bivariate relationships:
Quality vs. Alcohol Quality vs. pH Quality vs. Residual Sugar Quality vs. Fixed Acidity Quality vs. Volatile Acidity Residual Sugar vs. Alcohol Residual Sugar vs. pH pH vs. Alcohol Fixed Acidity vs. Density Fixed Acidity vs pH pH vs. Citric Acid Quality Level vs. Alcohol Quality Level vs. pH Quality Level vs. Residual Sugar
The correlogram indicates that the majority of relationships between variables are not highly correlated. The strongest relationships appear to be density vs. fixed acidity (r^2 = 0.67), citric acid vs. fixed acidity (r^2 = 0.67, pH vs. fixed acidity (r^2 = -0.68), and total sulfur dioxide vs. free sulfur dioxide (r^2 = 0.67). A correlation between citric acidy and fixed acidity is not surprising as they are both acids. Free sulfur dioxide is part of total sulfur dioxide so a correlation is expected. pH is a measure of acidity so the correlation between pH and fixed acidity is not surprising either. I am unsure what would cause a correlation between density and fixed acidity, but it could be that higher acidic liquid is more dense than lower acidic liquid.
It is interesting to see how the alcohol level tends to be much higher in the higher quality wines than the medium or low quality wines.
The median pH level decreases as the quality increases. The data is also more compact at the higheset quality level. As the quality level increases, the pH range decreases.
The range of outliers (plotted in red) is large in this plot, especially in the medium quality level. All of the outliers in all quality levels are high outliers; they have very high levels of residual sugar rather than very low levels of residual sugar.
Both the IQR and median volatile acidity decreases as quality level increases.
The main feature of interest in this analysis is quality, and if any features show an affect on quality. The correlogram shows that the highest r^2 value between quality and any other feature is alcohol (r^2 = 0.48). While you could see this trend in alcohol vs. quality scatterplot, the extreme number of quality ratings of 5 and 6 made it difficult to see the true relationship between alcohol and quality in the higher quality wines. The he quality level vs. alcohol boxplot shows this relationship much better, with a clear increase in median alcohol levels in the highest quality wines.
The relationship between quality vs. pH has an r^2 value of 0.06, which indicates practically zero correlation, and the corresponding scatterplot confirms this. However, when pH is compared to quality levels, there is a pattern in the boxplot. The median pH levels seem to decrease as the quality level increases, especially between the lower quality and medium quality wines.
The correlogram indicates that there is no correlation (r^2 = 0.01) between residual sugar and quality. Even after transforming the residual sugar data using log10 and plotting it against quality levels, there seems to be no clear relationship betwen residual sugar and quality.
One of the most poignant bivariate relationships discovered was the relationship between quality level and volatile acidity. The correlogram shows an r^2 value of -0.39 between quality and volatile acidity. The scatterplot shows this moderate correlation, but the relationship is much clearer when the data is grouped by quality level in the volatile acidity vs. quality level boxplot.
Some of the most interesting relationships were between variables that were not the main feature of interest. In fact, three of the four strongest r^2 values included fixed acidity vs. another variable. Fixed acidity had the strongest correlations with density, citric acid, and pH. As discussed previously, this is not surprising given that many of the variables are either acid themselves or a measure of acidity.
The strongest relationship, according to the r^2 value, is between pH and fixed acidity (r^2 = 0.68). However, once the data was cut into quality levels, the plots indicate that there are strong relationships between quality level and alcohol, quality level and volatile acidity, and quality level and pH.
Because the majority of the data has a medium quality level, the data is highly clustered. The smoother shows a slightly higher fixed acidity vs. density ratio for higher quality level wines vs. medium or lower quality level wines.
This plot doesn’t show strong trends, but it does show how the majority of the data falls in the lower alcohol, lower pH quadrant of the chart.
This plot shows some differences in the relationship between pH and citric acid by quality level.
The relationship between pH and fixed acidity seems to be uniform across all quality levels.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = RedWine)
## m2: lm(formula = quality ~ alcohol + pH, data = RedWine)
## m3: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)),
## data = RedWine)
## m4: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar, data = RedWine)
## m5: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity, data = RedWine)
## m6: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity, data = RedWine)
## m7: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)),
## data = RedWine)
## m8: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide, data = RedWine)
## m9: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid, data = RedWine)
## m10: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid + I(log10(chlorides)),
## data = RedWine)
## m11: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid + I(log10(chlorides)) +
## I(log10(free.sulfur.dioxide)), data = RedWine)
## m12: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid + I(log10(chlorides)) +
## I(log10(free.sulfur.dioxide)) + sulphates, data = RedWine)
##
## ==========================================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 4.426*** 4.526*** 4.539*** 3.014*** 3.559*** 3.971*** 3.834*** 3.901*** 3.871*** 4.036*** 3.359***
## (0.175) (0.387) (0.393) (0.393) (0.601) (0.576) (0.607) (0.605) (0.606) (0.605) (0.610) (0.606)
## alcohol 0.361*** 0.386*** 0.389*** 0.391*** 0.386*** 0.330*** 0.321*** 0.325*** 0.330*** 0.321*** 0.315*** 0.292***
## (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.018) (0.018) (0.018)
## pH -0.850*** -0.870*** -0.872*** -0.506** -0.259 -0.288 -0.426** -0.456** -0.513** -0.524** -0.487**
## (0.116) (0.116) (0.116) (0.159) (0.154) (0.154) (0.157) (0.158) (0.161) (0.160) (0.158)
## I(log10(residual.sugar)) -0.171 -0.495 -0.728* -0.199 -0.120 -0.142 -0.127 -0.084 -0.031 0.057
## (0.114) (0.313) (0.320) (0.309) (0.311) (0.309) (0.309) (0.310) (0.310) (0.305)
## residual.sugar 0.038 0.059 0.012 0.009 0.020 0.020 0.018 0.011 0.009
## (0.034) (0.035) (0.033) (0.033) (0.033) (0.033) (0.033) (0.033) (0.033)
## fixed.acidity 0.047*** 0.023 0.018 0.009 0.022 0.018 0.018 0.017
## (0.014) (0.014) (0.014) (0.014) (0.016) (0.016) (0.016) (0.016)
## volatile.acidity -1.249*** -1.255*** -1.233*** -1.332*** -1.282*** -1.253*** -1.114***
## (0.101) (0.101) (0.100) (0.117) (0.120) (0.120) (0.120)
## I(log10(total.sulfur.dioxide)) -0.124* 0.424** 0.405** 0.429** 0.210 0.103
## (0.058) (0.144) (0.145) (0.145) (0.178) (0.175)
## total.sulfur.dioxide -0.006*** -0.005*** -0.006*** -0.005*** -0.004**
## (0.001) (0.001) (0.001) (0.001) (0.001)
## citric.acid -0.234 -0.170 -0.134 -0.226
## (0.142) (0.146) (0.147) (0.145)
## I(log10(chlorides)) -0.248 -0.244 -0.552***
## (0.132) (0.131) (0.135)
## I(log10(free.sulfur.dioxide)) 0.203* 0.198*
## (0.095) (0.094)
## sulphates 0.813***
## (0.108)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.252 0.253 0.254 0.259 0.324 0.326 0.333 0.334 0.336 0.338 0.361
## adj. R-squared 0.226 0.251 0.252 0.252 0.257 0.322 0.323 0.330 0.331 0.332 0.333 0.356
## sigma 0.710 0.699 0.699 0.699 0.696 0.665 0.664 0.661 0.661 0.660 0.659 0.648
## F 468.267 268.888 180.161 135.446 111.286 127.233 109.959 99.324 88.685 80.298 73.574 74.568
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1694.466 -1693.325 -1692.710 -1687.117 -1613.455 -1611.152 -1602.601 -1601.237 -1599.456 -1597.172 -1568.959
## Deviance 805.870 779.508 778.397 777.799 772.376 704.393 702.367 694.895 693.711 692.167 690.192 666.261
## AIC 3448.114 3396.931 3396.650 3397.421 3388.234 3242.909 3240.303 3225.203 3224.475 3222.913 3220.344 3165.918
## BIC 3464.245 3418.440 3423.536 3429.684 3425.874 3285.926 3288.697 3278.974 3283.623 3287.438 3290.247 3241.198
## N 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
## ==========================================================================================================================================================================================================
When looking at fixed acidity vs. citric acid in terms of quality levels, there does seem to be a relationship. For any given pH, the citric acid level appears to increase as the quality level increases. This relationship isn’t as clear as the pH levelincreases above 3.5. This may be because of fewer data points at that level.
In the density vs. fixed acidity in terms of quality level plot, there is a very relationship. For a given fixed acidity level below 12, the average density level is higher for lower quality wines than higher quality wines.
In the density vs. fixed acidity in terms of quality level plot, the smoother for the lowest quality wines does not appear linear. It almost appears logarithmic, rising in density value slower as the fixed acidity value increases.
I created a model to predict quality with several variables, including alcohol, pH, residual sugar, fixed acidity, volatile acidity, total sulfur dioxide, citric acid, chlorides, free sulfur dioxide, and sulphates. The r^2 value of the model is 0.361. Because quality ratings are chosen by humans and are not scientic, an r^2 value of 0.361 is relatively strong.
The number of observations with 3, 4, 7, and 8 quality ratings is so much lower than the number of observations with 5 and 6 quality ratings. A bigger overall data set and more observations in the lower and higher quality ratings would improve the model.
This plot shows a similar median alcohol level for both low and medium quality wines. Interestingly the median alcohol level spikes much higher for the highest quality wines.
This plot indicates that the median pH level steadily decreases as the quality level increases.
This plot is notable because it visualizes the relationship between the two variables with the strongest correlation in the data set. Fixed acidity and pH have an r^2 value of -0.68. This relationship makes a lot of sense because a pH level is a measure of acidity; a lower pH indicates a substance is more acidic and a higher pH indicates a substance is more basic. This plot confirms this relationship. ——
I wanted to find out which variables most impacted quality. There were several insights I found during the exploration of this data set.
Alcohol and Quality: There is a clear relationship between quality level and alcohol content, but only for the highest quality wines. This relationship is unclear until quality is separated into quality levels.
pH and Quality: There is a negative relationship between pH and quality levels. This is unclear until quality is separated into quality levels.
Volatile Acidity and Quality: There is a very strong relationship between volatile relationship and quality levels. The relationship is only moderate when quality is not divided into levels.
The correlogram was very helpful in showing correlations between variables except for quality. In several cases, the relationship between quality and a specific variable was unclear until quality was separated into quality levels.
It would be most helpful to know the type of red wine, such as Cabernet Sauvignon, Pinot Noir, Merlot, etc. It is very difficult to analyze trends when the type of the wine is unknown. For example, a certain wine type may be expected to have more alcohol and therefore someone rating the quality of that wine would rate it more favorably than someone rating a quality of wine that was expected to have a lower alcohol content. Furthermore, it would be interesting to analyze wines from different parts of the world to see if there is a relationship between quality or any of the variables and region.